cient Parsing for Information Extraction
نویسندگان
چکیده
Several (and successfull) Information Extraction systems have recently replaced the core parsing components with shallow but more ecient recognizers. In this paper we argue that the absence of an underlying grammatical recog-nizer, given the complex nature of several (non-english) languages , is a strong limitation for text processing functionali-ties, like those an IE system needs. We propose a robust and ecient syntactic recognizer mainly aimed to capture grammatical information crucial for several linguistic and non linguistic inferences. The proposed system is based on a novel architecture exploiting two major principles: lexicalization and stratication of the parsing process. As several linguistic theories (e.g. HPSG) and parsing frameworks (e.g. LTAG, SLTAG, lexicalized probabilistic parsing) suggest, lexicon-driven systems ensure the suitable forms of grammatical control for many complex phenomena. In our system an analysis guided by information on typical verb projections (e.g. verb subcat-egorization structures) is coupled with extended locality constraints (i.e. recognition of clause boundaries). Furthermore, stratication is also employed. A cascade of processing steps starts from chunk recognition and proceeds through clause analysis to dependency detection. Recognition of chunks allows to minimize the input ambiguity to the remaining phases. The resulting system is thus robust against ungrammatical phenomena (e.g. complex clause embedding, misspellings, unknown words). Eciency is also retained, although ambiguous phenomena (multiple PP attachments) are recognized. 1 Introduction Several (and successful) IE systems have recently replaced the core parsing components with shallow but more ecient recognizers [1, 8]. However, the absence of a grammatical rec-ognizer, given the complex nature of several (non-english) languages , is a strong limitation for text processing functionali-ties, like those an IE system needs. Let us provide a sentence, extracted and translated from a nancial corpus in Italian: 1 Assuming to have at disposal a certain budget level for an environmental recovery action, ACE s.p.a. intends to prepare the necessary plan to coordinate the following work activities, which will end in the completion of the operational implementation project. that exhibits a complex but very common structure in Ital-ian texts. Typical information to be extracted from the above sentence is the named organization (i.e., Ace), the type of intended activity (i.e., environmental recovery) and a variety of 1 see the Appendix for the Italian version. specications and participants to the core event. For example, understanding of the intended action of the Ace implies the recognition of the causative/agent role of the Ace itself in the subordinate clause. …
منابع مشابه
برچسبزنی خودکار نقشهای معنایی در جملات فارسی به کمک درختهای وابستگی
Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملAutomatic Term Identi cation and Classi cation in Biology Texts
The rapid growth of collections in online academic databases has meant that there is increasing di culty for experts who want to access information in a timely and e cient way. We seek here to explore the application of information extraction methods to the identi cation and classi cation of terms in biological abstracts from MEDLINE. We explore the use of a statistical method and a decision tr...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملFeature extraction in opinion mining through Persian reviews
Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...
متن کاملParser Framework for Information Extraction
Various document parsing methods are required in applications that perform complex information extraction. The development of such parsing schemes can be simplified by decomposing the complex extraction process into simple steps that can be realized by elementary parser modules. Authors present a general framework for the development of IE applications by applying different parsers independentl...
متن کامل